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Abstract. - We have investigated the origin of fluctuations in the aggregated behaviour of an 
open-source software community. In a recent series of papers [1-3] , de Menezes and co-workers 
have shown how to separate internal dynamics from external fluctuations by capturing the 
simultaneous activity of many system's components. In spite of software development being a 
planned activity, the analysis of fluctuations reveals how external driving forces can be only 
observed at weekly and higher time scales. Hourly and higher change frequencies mostly relate 
to internal maintenance activities. There is a crossover from endogenous to exogenous activity 
depending on the average number of file changes. This new evidence suggests that software 
development is a non-homogeneous design activity where stronger efforts focus in a few project 
files. The crossover can be explained with a Langevin equation associated to the cascading 
process, where changes to any file trigger additional changes to its neighbours in the software 
network. In addition, analysis of fluctuations enables us to detect whether a software system 
can be decomposed in several subsystems with different development dynamics. 



Multiple time series are available for complex systems whose dynamics is the outcome of 
a large number of agents interacting through a complex network. Recent measurements on 
the fluctuations at network nodes [1-4] indicate a power-law scaling between the mean(/j) 

and the standard deviation <7j = y((fi ~ (/i)) 2 ^ °f the time-dependent activity fi (t) of node 
z=l. . . N, that is, 

~ (k) a (i) 

where a is an exponent which can take the values between 1/2 and 1 [1]. It seems that 
real systems accept a classification in two different classes depending on the value of this 
exponent. Systems with internal (or endogenous) dynamics like the physical Internet and 
electronic circuits show the exponent a = 1/2. On the other hand, systems either involving 
human interactions (i. e, WWW, highway traffic) or strongly influenced by external forces 
(i.e., rivers) belong to the class defined by the universal exponent a = 1. Interestingly 
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some systems display both types of behaviour when analysed at different scales of detail. 
For example, visits to web pages and routing of data packets in the Internet are dynamical 
processes with different origins [1]. The former process is driven by user's demands while 
the latter accounts for a significant amount of internal activity even in the absence of human 
interaction (i.e., routing protocols). 

Here, we introduce for the first time this theoretical framework to the analysis of human 
dynamics observed in open-source software development, which is an important activity with 
economical and social implications. Open-source software (OSS) [4] often requires the collec- 
tive efforts of a large number of experienced programmers (also called developers or software 
engineers). How individual expertise and social organization combines to yield a complex 
and reliable software system is still largely unknown. Interestingly, many remarkable features 
of OSS cannot be detected in the activity of single programmers [6]. This suggests that, in 
order to understand how OSS takes place, the activities of many developers must be studied 
simultaneously. A prerequisite to study processes of software change is to understand the 
social organization of OSS. These communities combine two groups of people in a hierarchical 
or onionlikc structure: (1) an inner team of software developers that develop and maintain 
the source code files and (2) the potentially larger community of software users (see fig.lA). 
This group of users triggers new development activities by issuing modification requests. In 
addition, every software change has a non-zero probability to inject new software defects, 
which in turn may trigger a cascade of repair changes [7] . 

We look at software development as a sequence of software change events. Previous studies 
on software maintenance dynamics proposed a classification of changes in categories associ- 
ated to different project stages [8]. These studies reported the frequency of every type of 
change. However, the software database analyzed here (see below) does not indicate if a 
change addresses a user request or not. Instead, we suggest how the analysis of fluctuations 
can be used to obtain this information. We propose a new classification of software change as 
endogenous or exogenous depending on whether the change is independent of previous events 
or not. Because changes requested by users arc independent from each other [8], we will refer 
to them as " exogenous" . On the other hand, cascades of correlated changes are " endogenous" 
(see fig.lA). In a related paper, Sornette and co-authors make a distinction between endoge- 
nous and exogenous events in the context of book sales [9] . It was shown that exogenous and 
endogenous sales peaks have different relaxation dynamics. 

Data. - Detailed activity registers of the OSS community reside in centralized source code 
repositories, like the Concurrent Version System (CVS) [5]. During the process of software 
change, developers access files to add, change or remove one or more lines of source code. 
The CVS database tracks each file revision submitted by a developer. The activity of many 
developers progresses in parallel with simultaneous changes to many files. However, the CVS 
system provides some mechanisms to ensure that any given file cannot be changed by more 
than one developer simultaneously. In addition, the CVS stores all source code files required 
to build the software system. We have shown this set of project files describes a complex 
network with an asymmetric scale-free architecture [12]. Following [12], we can reconstruct 
this software network G = (V,E) from the collection of source code files, where each node 
Vi G V represents a single source file and the link (vi ,Vj) G E indicates a compile-time 
dependency between files Vi and Vj (see fig.lA). It can be shown that the number of links 
L(t) growths logarithmically with the number of nodes N(t) in the software network [12]. 
Our analysis combines structural information provided by the software network with the time 
series of file changes stored in the CVS. We have validated our results with several software 
projects [14]. 
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Fig. 1 - (A) Schematic representation of an OSS community (see text). Scaling of fluctuations 
°i ~ (/i^*)" with average change activity for the software project XFree86, measured at different 
time resolutions: (B) At=6 hours and (C) A t =9600 hours. (D) shows the dependence of the exponent 
a with the time window At. Here, the exponent a grows from 0.51 to 0.92. 



Analysis of fluctuations. We have analysed the aggregated activity of software de- 
velopers at different timescales. Given a fixed measurement time window At, we measure 
development activity by looking at the dynamics of single file changes: 

Jf<(*)= E c *( r ) ( 2 ) 

r<E[t,t+At] 

where Cj(i) — 1 when file Vi has been changed at time t and Ci(t) — otherwise. Notice 
how eq. |J2J corresponds to the coarse-graining of the time series of file change events. In the 
following, we will omit the subscript At whenever the timescale is implicit. We also define 
global activity F At (t) or the number of project changes at time t: 

N 

F At (t)=J2f t At (t)- (3) 

i=l 

In figure ^} we display the scaling of fluctuations with the average activity (see eq. Q) 
in a software project at different time scales. There is a dependence of the scaling exponent 
with the time window At. The observed exponent is less than 1 for a wide range of time 
scales (see ^3) , thus suggesting and endogenous origin of development activity. On the other 
hand, the analysis of fluctuations in various OSS projects at monthly and large time scales 
yields an exponent closer to 1 (see DP). The external driving force becomes stronger when 
At increases. In the following, we further investigate the origin of fluctuations in software 
development dynamics with a more robust measure. 

Crossover in internal dynamics. - We can determine if OSS dynamics has an endoge- 
nous or exogenous origin by separating internal and external contributions [2]. We split the 
timcscries of individual file changes fi(t) in two different components: (i) internal fluctua- 
tions p nt {t) governed by local interaction rules and (ii) external fluctuations f ext (t) caused 
by environmental variations, that is, 
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Fig. 2 - (A) Crossover observed in the scaling of internal fluctuations with average flux, around 10 -2 , 
for the Apache project. In (B) and (C) we show the binned distribution of ratios for the two project 
file subsets {/) < 1CF 2 (open circles) and (/) > 1CF 2 (black circles), respectively. In all plots, At = 
10 hours. 



fi(t) = f nt (t)+r t (t) (4) 

where the external activity f ext (t) represents the expected fraction of changes shared by 
file vf. 

N 

ff xt {t) = A l Y J m (5) 

i=l 

Here Ai is file centrality [2] , defined as the overall fraction of changes received by the file 

A- = ^ 

and T is the timespan of software development. Notice that file centrality Ai is independent 
of the observation window Ai. By definition, external fluctuations allways scale linearly with 
the average number of file changes, a ext ~ (/). On the other hand, the exponent a governing 
the scaling of internal fluctuations with average flux a mt ~ (/) indicates if dynamics has an 
endogenous (a — 0.5) or exogenous (a = 1) origin. Interestingly, we observe a crossover in 
the internal activity of open-source software development depending on the average number 
of file changes (/) (see fig. EJ\). The crossover is less visible at large time scales At. 

The analysis of single node fluctuations provides additional evidence for this crossover. 
The ratio rji = af xt /al nt between external and internal fluctuations indicates wether node 
dynamics is external (rji >> 1) or internal (rji << 1). In order to characterize the system's 
overall behavior, we can compute the distribution of ratios P(?7i) [2]. This measure was shown 
to be robust to variations in the measurement time window At. For example, figure [21 displays 
the distribution of ratios P(r]i) measured in two different subsets of files in the Apache project. 
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Fig. 3 - Measuring internal propagation of changes in the Apache project. (A) The cumulative total 
activity distribution P>[fi) is broad scale. (B) Scaling of average activity /flux with node degree, 
ki ~ (f) 13 , with (3 ~ 0.34. (C) Average neighbors activity (/„„) scales with average node activity 
{/) < fo and then saturates (fnn) = const once the crossover fo = 10 -2 ' 5 is reached, (/) > fo- 
In order to reduce the noise, data have been logarithmically binned in (B) and (C) plots. The 
measurement window is At — 10 hours. 

On the one hand, we can see that P(r)i) is peaked around 0.55 (see fig. for the subset 
of files with (/) > 10~ 2 . This suggests exogenous activity in a core set of project files (those 
depicted with black circles in fig[2K an( l n gl2P)- Moreover, P(rji) is skewed towards lower 
ratios (around 0.1) for project files with (/) < 10~ 2 (white circles in figGK an d figEP) • On 
the other hand, activities involving less changed files have an endogenous origin (see fig. HP). 

Propagation of Changes. - Crossover in internal fluctuations stems from the inhomoge- 
neous nature of software development. A large development effort aims to a small number of 
core files, which change more frequently than other project files. In a related paper, network 
heterogeneity was shown to have an impact in the dynamics of diffusion processes [3] . When 
the diffusive process is multiplicative and the underlying topology is intrinsically inhomoge- 
neous, there is a crossover from a = 0.5 to a = 1 in the scaling of fluctuations with the average 
flux (eq.(I)). Such diffusive network processes can be modeled through the Langevin equation 
by a mean-field approximation [3] . The change of mass at node i during a unit time interval 
is: 



fi(t + l) = fi(t) + jr±-r) j (t)f j (t) (7) 

3 3 

where the second term represents the incoming mass from the nearest neighbors and rjj (t) 
is a uniform random variable (i.e., multiplicative noise term). Because we are focusing in the 
internal diffusion process we do not take into account additional terms like outgoing mass 
and/or uncorrelated Gaussian noise. This type of diffusion processes display a characteristic 
scaling in the probability distribution P(fi) ~ fi 1 ^ [3]. The continuous approximation of 
the previous equation is 

k ■ 

ft ~ E r:^(*)/i(*) = (h) (%•(*)/,(*)> (8) 
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where (k nn ) denotes the average degree of a node's nearest neighbors. Because rjj(t) and 
fj(t) are independent variables and assuming that (k nn ) is function of (A;): 



where (/„„) indicates the average incoming mass in the nearest neighbors of a node. For 
the Barabasi- Albert network [13], the numerical solution of the above equation shows that 
(fnn) decreases as (/) increases and then saturates to a constant value for (/} > /o (see [3] for 
details) . The observed value of fg indicates the crossover between endogenous and exogenous 
dynamics. 

Interestingly, model requeriments (i.e, diffusion process on a heterogeneous network) are 
met by software projects. Empirical studies of software maintenance reported that change 
propagation is a central feature of software maintenance [10]. Propagation is necessary because 
there are dependencies between project files and developers must ensure that related files are 
changed in a consistent way. Recall the software network G captures these file dependencies 
(see above) . The software network displays a scale- free structure due to extensive reuse during 
software development [12]. 

Furthermore, our measurements on real OSS projects seem consistent with model pre- 
dictions. We have observed that, for all software projects analyzed here, the propability 
distribution P>(/i) has a long tail. For example, power-law fitting for the Apache project pre- 
dicts an exponent —1 — [i ks —2.04 for the incoming flux distribution (see fig. cumulative 
probability distribution is used to reduce the impact of noisy fluctuations). As hypothesized 
above, the plot in fig. [3)3 shows that key files having a large number of dependencies are 
changed more frequently. We have checked that (knn) is a function of (k) (not shown here). 
As seen in fig. the average neighbour activity increases with average node activity (/) and 
it is approximately constant (f nn ) ~ const for (/) > with /o = 3.16 * 10~ 3 . This value of 
/o is consistent with the observation made in fig. In this case, eq. © predicts (/) ~ (A;)" 
with a = 1 to be compared with the measured exponent 0.72 (see fig. EK)- 

Different subsystems display different scaling laws. - A practical application of fluctua- 
tion analysis is the identification of files that change together [11]. This suggests a method for 
community detection based on individual node dynamics. In our context, we have observed 
that some subsystems are characterized by different scaling laws in their internal fluctuations 
with average activity. For example, figure 0] summarizes the analysis of internal fluctuations 
in the software project TortoiseCVS. There are two clearly defined subsystems, the main ap- 
plication subsystem (dark balls) and the window library wxwin (white balls), characterized 
by different change dynamics (see fig. El^.). The crossover behaviour can be appreciated in 
the scaling of internal fluctuations for the main TortoiseCVS subsystem (the exponent for 
(/) > fo is a int ~ 0.85, see fig. Hf3). The main subsystem concentrates the largest fraction 
of changes. On the other hand, the crossover is not observed in the scaling for the wxwin 
subsystem (see fig. EJ3), which is an utility library imported from an external development 
team. The minimal amount of activity regarding the wxwin subsystem (sporadic changes in 
the library communicated by the external team and minor adjustments required by the main 
subsystem) suggests an explanation for the absence of a crossover. 

In short, we have provided empirical and theoretical evidence for a well defined crossover 
in the dynamics of software change. This is the first reported example of such behaviour in 
a large-scale technological system. It shows that OSS systems exhibit some traits in common 
with other complex networks. The presence of crossover allows to distinguish between internal 



(Vj) (fnn) = J ((h)) (fnn) 



(9) 



dt {knn) 
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Fig. 4 - Scaling of internal fluctuations in different subsystems of the TortoiseCVS software project. 
(A) Modular organization of the corresponding software network, where node represents files and 
links depict dependencies. Nodes within the same subsystem are displayed in the same colour. (B) 
Different scaling laws of internal subsystem fluctuations with average flux, Oi ~ (fi) aint , for the main 
application subsystem (black balls) and for the window subsystem (so-called wxwin, white balls). 

and external components of the dynamics and then provides a powerful approach to uncover 
the relative importance of exogenous versus endogeneous dynamics. 

* * * 

Conclusions. - Sergi Valverde dedicates this paper to his daugther Violeta. We thank 
Ricard Sole and Damien Challet. This work has been supported by the EU within the 6th 
Framework Program under contract 001907 (DELIS). 
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